Robust Control of the Multi-armed Bandit Problem
نویسندگان
چکیده
We study a robust model of the multi-armed bandit (MAB) problem in which the transition probabilities are ambiguous and belong to subsets of the probability simplex. We first show that for each arm there exists a robust counterpart of the Gittens index that is the solution to a robust optimal stopping-time problem. We then characterize the optimal policy of the robust MAB as a project-by-project retirement policy but we show that arms become dependent so the policy based on the robust Gittins index is not optimal. For a project selection problem, we show that the robust Gittins index policy is near optimal but its implementation requires more computational effort than solving a non-robust MAB problem. Hence, we propose a Lagrangian index policy that requires the same computational effort than evaluating the indices of a non-robust MAB and at the same time provides a near-optimal solution to the robust project selection problem.
منابع مشابه
On Robust Arm-Acquiring Bandit Problems
In the classical multi-armed bandit problem, at each stage, the player has to choose one from N given projects (arms) to generate a reward depending on the arm played and its current state. The state process of each arm is modeled by a Markov chain and the transition probability is priorly known. The goal of the player is to maximize the expected total reward. One variant of the problem, the so...
متن کاملContributions to the Asymptotic Minimax Theorem for the Two-Armed Bandit Problem
The asymptotic minimax theorem for Bernoully twoarmed bandit problem states that the minimax risk has the order N as N → ∞, where N is the control horizon, and provides lower and upper estimates. It can be easily extended to normal two-armed bandit. For normal two-armed bandit, we generalize the asymptotic minimax theorem as follows: the minimax risk is approximately equal to 0.637N as N →∞. Ke...
متن کاملDeviations of Stochastic Bandit Regret
This paper studies the deviations of the regret in a stochastic multi-armed bandit problem. When the total number of plays n is known beforehand by the agent, Audibert et al. (2009) exhibit a policy such that with probability at least 1− 1/n, the regret of the policy is of order logn. They have also shown that such a property is not shared by the popular ucb1 policy of Auer et al. (2002). This ...
متن کاملTwo-Armed Bandit Problem, Data Processing, and Parallel Version of the Mirror Descent Algorithm
We consider the minimax setup for the two-armed bandit problem as applied to data processing if there are two alternative processing methods available with different a priori unknown efficiencies. One should determine the most effective method and provide its predominant application. To this end we use the mirror descent algorithm (MDA). It is well-known that corresponding minimax risk has the ...
متن کاملMulti-Armed Bandit Policies for Reputation Systems
The robustness of reputation systems against manipulations have been widely studied. However, the study of how to use the reputation values computed by those systems are rare. In this paper, we draw the analogy between reputation systems and multi-armed bandit problems. We investigate how to use the multi-armed bandit selection policies in order to increase the robustness of reputation systems ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014